The growing use of solutions based on data in sensitive sectors like healthcare and cybersecurity are usually limited by a lack of data and strict privacy standards. This research paper set out to explore how generative artificial intelligence (AI) tools, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Gaussian Mixture Models (GMMs) can be used to produce privacy-preserving synthetic data sets capable of supplementing the limited amount of data by preserving confidentiality. The descriptive-analytical research design was chosen, which was supported by empirical demonstrations in two areas: healthcare, where tabular records of patients will be used to make a prediction of a disease, and cybersecurity where benign network traffic flows will be involved in detecting anomalies. Synthetic datasets were tested on three important dimensions: fidelity, which is used to gauge similarity with real data; utility, which is used to gauge performance in downstream machine learning applications; and privacy, which is used to gauge risks of data memorization or leakage. The findings showed that the class-conditional GMMs were useful in modeling distributions of patient features, improving predictive modeling with real data, and synthetic benign traffic helped to detect anomalies very well in cybersecurity tasks. Privacy evaluations indicated that no data was memorized to give an individual record which reduced the re-identification vulnerability. Comprehensively, the research paper shows that generative AI can deliver high-fidelity, utility-based, and privacy-conscious synthetic datasets, which is a scalable solution to data shortage as well as the significance of strict validation, ethical supervision, and control in sensitive data use.
Introduction
The development of AI in sensitive fields like healthcare and cybersecurity is constrained by data scarcity and strict privacy regulations, limiting access to real datasets necessary for effective machine learning. Generative AI models—such as GANs, VAEs, and GMMs—offer a promising solution by producing synthetic data that mimics real data’s statistical properties without compromising privacy.
In healthcare, synthetic patient records can augment small or imbalanced datasets to improve disease prediction models while maintaining confidentiality. In cybersecurity, synthetic benign network traffic can help train anomaly detection systems more effectively, especially where real attack data is scarce or sensitive.
The research critically evaluates generative AI’s role in creating synthetic data, focusing on three key dimensions: fidelity (how closely synthetic data resembles real data), utility (usefulness for downstream tasks), and privacy (risk of sensitive information leakage). Using real datasets, the study applied class-conditional GMMs for healthcare and GMMs for benign traffic in cybersecurity, assessing synthetic data quality via PCA, model performance metrics (accuracy, ROC-AUC), and privacy proxies like nearest-neighbor distances.
Results show that synthetic data preserves statistical structure and, when combined with real data, enhances model robustness and performance. Privacy assessments suggest synthetic data do not memorize or replicate individual records, mitigating privacy risks. While synthetic data cannot fully replace real data, it serves as a valuable supplement to overcome data limitations and ethical concerns.
Overall, the study highlights the potential of generative AI to drive innovation responsibly in data-sensitive domains by balancing fidelity, utility, and privacy, and offers a framework for evaluating synthetic data’s effectiveness.
Conclusion
This paper brings forth the potential transformative nature of generative artificial intelligence (AI) in the creation of synthetic data, especially in sensitive areas like healthcare and cybersecurity, where the lack of information and security issues are paramount. The study using the models of Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Gaussian Mixture Models (GMMs) showed that realistic and non-identifiable synthetic data could be produced to supplement scarce real-world data.
Assessment on the fidelity, utility, and privacy dimensions reached the conclusion that synthetic data sets would be useful to study the statistical structure of actual data, predictive modeling, and detection of anomalies, and reduce the risks of individual data disclosure. Synthetic records, which are class-conditional, were found to improve disease prediction models in healthcare, and be used to improve the robustness of anomaly detection systems against emerging threats in cybersecurity. The results define synthetic data as a strategic complement, rather than a substitute, of real data, which presents prospects of data-enhancement, bias-elimination, and safer analysis. However, the paper emphasizes the need to have stringent validation systems, ethical guidelines and privacy protecting methods, like differential privacy, to reduce the risks that may arise. Future studies must aim at benchmark unification, the hybridization of generative-privacy systems, and diversification of cross-domain synthetic data use to achieve the optimal utility and confidentiality.
References
[1] Agrawal, G., Kaur, A., & Myneni, S. (2024). A review of generative models in generating synthetic attack data for cybersecurity. Electronics, 13(2), 322.
[2] Baowaly, M. K., Lin, C.-C., Liu, C.-L., & Chen, K.-T. (2019). Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association (JAMIA), 26(3), 228–241.
[3] Bharathi, S. P., Subin, P. G., Hariharan, R., Balaji, T. S., & Balaji, S. (2024, November). Generative AI for Synthetic Data Generation in IoT-Based Healthcare Systems. In 2024 Second International Conference Computational and Characterization Techniques in Engineering & Sciences (IC3TES) (pp. 1-5). IEEE.
[4] Esteban, C., Hyland, S. L., & Rätsch, G. (2017). Real-valued (medical) time series generation with recurrent conditional GANs. ArXiv preprint arXiv:1706.02633.
[5] Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321, 321–331.
[6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS), 27, 2672–2680.
[7] Innocent, E. K. (2024). Enhancing Data Security in Healthcare with Synthetic Data Generation: An Autoencoder and Variational Autoencoder Approach (Master\'s thesis, Oslo Metropolitan University).
[8] Jadon, A., & Kumar, S. (2023, July). Leveraging generative AI models for synthetic data generation in healthcare: balancing research and privacy. In 2023 International Conference on Smart Applications, Communications and Networking (SmartNets) (pp. 1-4). IEEE.
[9] Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR).
[10] Lin, W., Zhang, Y., Chen, W., & Xu, M. (2020). Generating network intrusion detection datasets using variational autoencoders. Computers & Security, 92, 101740.
[11] Salem, A., Backes, M., & Zhang, Y. (2020). Don’t generate me: Training privacy-respecting synthetic data with generative models. ArXiv preprint arXiv:2012.00863.
[12] Stadler, T., Oprisanu, B., & Troncoso, C. (2022). Synthetic data—anonymization ground truth or a privacy mirage? NPJ Digital Medicine, 5(1), 1–10.
[13] Teo, Z. L., Quek, C. W. N., Wong, J. L. Y., & Ting, D. S. W. (2024). Cybersecurity in the generative artificial intelligence era. Asia-Pacific Journal of Ophthalmology, 13(4), 100091.
[14] Torfi, A., Fox, E. B., & Reddy, C. K. (2020). Differentially private synthetic medical data generation using generative adversarial networks. ArXiv preprint arXiv:2012.11774.
[15] Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN. Advances in Neural Information Processing Systems (NeurIPS), 32, 7333–7343.